EPPS 6302 Methods of Data Collection and Production

Assignment 4: Webscraping

  • To complete Assignment 4, start by using rvest_wiki01.R to scrape foreign reserve data from Wikipedia and modify the code as needed to capture additional tables. Then, search for government documents on govinfo.gov and use govtdata01.R to download ten documents.
  • In the report, describe any challenges encountered in the scraping process, such as variations in table structures on Wikipedia, anti-scraping measures on government websites, or issues with dynamic content loading. Evaluate the usability of the scraped data, noting any limitations like incomplete or inconsistent data.
  • For improvement, I think that we need to consider using proxy rotation to bypass anti-bot mechanisms, enhancing data-cleaning techniques to ensure consistency, and leveraging robust scraping tools to handle complex or changing website structures better.
## Workshop: Scraping webpages with R rvest package
# Prerequisites: Chrome browser, Selector Gadget

# install.packages("tidyverse")
library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.4
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.4.3     ✔ tibble    3.2.1
✔ lubridate 1.9.2     ✔ tidyr     1.3.0
✔ purrr     1.0.2     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
#install.packages("rvest")
library(rvest)

Attaching package: 'rvest'
The following object is masked from 'package:readr':

    guess_encoding
url <- 'https://en.wikipedia.org/wiki/List_of_countries_by_foreign-exchange_reserves'
#Reading the HTML code from the Wiki website
wikiforreserve <- read_html(url)
class(wikiforreserve)
[1] "xml_document" "xml_node"    
## Get the XPath data using Inspect element feature in Safari, Chrome or Firefox
## At Inspect tab, look for <table class=....> tag. Leave the table close
## Right click the table and Copy --> XPath, paste at html_nodes(xpath =)

foreignreserve <- wikiforreserve %>%
  html_nodes(xpath='//*[@id="mw-content-text"]/div[1]/table[1]') %>%
  html_table()
class(foreignreserve) # Why the first column is not scrapped?
[1] "list"
fores = foreignreserve[[1]][,c(1, 2,3,4,5,6,7,8) ] # [[ ]] returns a single element directly, without retaining the list structure.


# 
names(fores) <- c("Country", "Forexreswithgold", "Date1", "Change1","Forexreswithoutgold", "Date2","Change2", "Sources")
colnames(fores)
[1] "Country"             "Forexreswithgold"    "Date1"              
[4] "Change1"             "Forexreswithoutgold" "Date2"              
[7] "Change2"             "Sources"            
head(fores$Country, n=10)
 [1] "China"        "Japan"        "Switzerland"  "India"        "Russia"      
 [6] "Taiwan"       "Saudi Arabia" "Hong Kong"    "South Korea"  "Singapore"   
# Sources column useful?

## Clean up variables
## What type is Date?

# Convert Date1 variable
fores$Date1 = as.Date(fores$Date1, format = "%d %b %Y")
class(fores$Date1)
[1] "Date"
write.csv(fores, "fores.csv", row.names = FALSE) # use fwrite?

# New Table

foreignreserve1 <- wikiforreserve %>%
  html_nodes(xpath='//*[@id="mw-content-text"]/div[1]/table[2]') %>%
  html_table()
class(foreignreserve1) # Why the first column is not scrapped?
[1] "list"
fores1 = foreignreserve1[[1]][,c("USD", "EUR","JPY","GBP","CAD","RMB","AUD","CHF") ] # [[ ]] returns a single element directly, without retaining the list structure.


# 
names(fores1) <- c("USD", "EUR","JPY","GBP","CAD","RMB","AUD","CHF")
colnames(fores1)
[1] "USD" "EUR" "JPY" "GBP" "CAD" "RMB" "AUD" "CHF"
write.csv(fores1, "fores1.csv", row.names = FALSE) # use fwrite?